Goto

Collaborating Authors

 uci machine learning repository


A principled approach for data bias mitigation

AIHub

How do you know if your data is fair? And if it isn't, what can you do about it? Machine learning models are increasingly used to make high-stakes decisions, from predicting who gets a loan to estimating the likelihood that someone will reoffend. But these models are only as good as the data they learn from [Shahbazi 2023]. If the training data is biased, the model's decisions will likely be biased too [Hort 2024, Pagano 2023].


Using Noise to Infer Aspects of Simplicity Without Learning Zachery Boner 1 Harry Chen

Neural Information Processing Systems

Noise in data significantly influences decision-making in the data science process. In fact, it has been shown that noise in data generation processes leads practitioners to find simpler models. However, an open question still remains: what is the degree of model simplification we can expect under different noise levels? In this work, we address this question by investigating the relationship between the amount of noise and model simplicity across various hypothesis spaces, focusing on decision trees and linear models. We formally show that noise acts as an implicit regularizer for several different noise models. Furthermore, we prove that Rashomon sets (sets of near-optimal models) constructed with noisy data tend to contain simpler models than corresponding Rashomon sets with non-noisy data. Additionally, we show that noise expands the set of "good" features and consequently enlarges the set of models that use at least one good feature. Our work offers theoretical guarantees and practical insights for practitioners and policymakers on whether simple-yet-accurate machine learning models are likely to exist, based on knowledge of noise levels in the data generation process.



Feature Learning for Interpretable, Performant Decision Trees

Neural Information Processing Systems

Points were sampled uniformly in the bands denoted by dashed lines. We posit that these barriers are due, at least in part, to the sensitivity of decision trees to transformations of the input resulting from greedy construction and simple decision rules. Of these, key limitation is the latter; even if we replace greedy construction with a perfect tree learner, simple distributions can nonetheless require an arbitrarily large axis-aligned tree to fit.


A benchmark of categorical encoders for binary classification

Neural Information Processing Systems

Categorical encoders transform categorical features into numerical representations that are indispensable for a wide range of machine learning models. Existing encoder benchmark studies lack generalizability because of their limited choice of 1. encoders, 2. experimental factors, and 3. datasets. Additionally, inconsistencies arise from the adoption of varying aggregation strategies. This paper is the most comprehensive benchmark of categorical encoders to date, including an extensive evaluation of 32 configurations of encoders from diverse families, with 48 combinations of experimental factors, and on 50 datasets. The study shows the profound influence of dataset selection, experimental factors, and aggregation strategies on the benchmark's conclusions -- aspects disregarded in previous encoder benchmarks.




RefiDiff: Progressive Refinement Diffusion for Efficient Missing Data Imputation

Ahamed, Md Atik, Ye, Qiang, Cheng, Qiang

arXiv.org Artificial Intelligence

Missing values in high-dimensional, mixed-type datasets pose significant challenges for data imputation, particularly under Missing Not At Random (MNAR) mechanisms. Existing methods struggle to integrate local and global data characteristics, limiting performance in MNAR and high-dimensional settings. We propose an innovative framework, RefiDiff, combining local machine learning predictions with a novel Mamba-based denoising network efficiently capturing long-range dependencies among features and samples with low computational complexity. RefiDiff bridges the predictive and generative paradigms of imputation, leveraging pre-refinement for initial warm-up imputations and post-refinement to polish results, enhancing stability and accuracy. By encoding mixed-type data into unified tokens, RefiDiff enables robust imputation without architectural or hyperparameter tuning. RefiDiff outperforms state-of-the-art (SOT A) methods across missing-value settings, demonstrating strong performance in MNAR settings and superior out-of-sample generalization. Extensive evaluations on nine real-world datasets demonstrate its robustness, scalability, and effectiveness in handling complex missingness patterns.


MLPrE -- A tool for preprocessing and exploratory data analysis prior to machine learning model construction

Maxwell, David S, Darkoh, Michael, Samudrala, Sidharth R, Chung, Caroline, Schmidt, Stephanie T, Al-Lazikani, Bissan

arXiv.org Artificial Intelligence

With the recent growth of Deep Learning for AI, there is a need for tools to meet the demand of data flowing into those models. In some cases, source data may exist in multiple formats, and therefore the source data must be investigated and properly engineered for a Machine Learning model or graph database. Overhead and lack of scalability with existing workflows limit integration within a larger processing pipeline such as Apache Airflow, driving the need for a robust, extensible, and lightweight tool to preprocess arbitrary datasets that scales with data type and size. To address this, we present Machine Learning Preprocessing and Exploratory Data Analysis, MLPrE, in which SparkDataFrames were utilized to hold data during processing and ensure scalability. A generalizable JSON input file format was utilized to describe stepwise changes to that DataFrame. Stages were implemented for input and output, filtering, basic statistics, feature engineering, and exploratory data analysis. A total of 69 stages were implemented into MLPrE, of which we highlight and demonstrate key stages using six diverse datasets. We further highlight MLPrE's ability to independently process multiple fields in flat files and recombine them, otherwise requiring an additional pipeline, using a UniProt glossary term dataset. Building on this advantage, we demonstrated the clustering stage with available wine quality data. Lastly, we demonstrate the preparation of data for a graph database in the final stages of MLPrE using phosphosite kinase data. Overall, our MLPrE tool offers a generalizable and scalable tool for preprocessing and early data analysis, filling a critical need for such a tool given the ever expanding use of machine learning. This tool serves to accelerate and simplify early stage development in larger workflows.


RS-ORT: A Reduced-Space Branch-and-Bound Algorithm for Optimal Regression Trees

Heredia, Cristobal, Chumpitaz-Flores, Pedro, Hua, Kaixun

arXiv.org Artificial Intelligence

Mixed-integer programming (MIP) has emerged as a powerful framework for learning optimal decision trees. Yet, existing MIP approaches for regression tasks are either limited to purely binary features or become computationally intractable when continuous, large-scale data are involved. Naively binarizing continuous features sacrifices global optimality and often yields needlessly deep trees. We recast the optimal regression-tree training as a two-stage optimization problem and propose Reduced-Space Optimal Regression Trees (RS-ORT) - a specialized branch-and-bound (BB) algorithm that branches exclusively on tree-structural variables. This design guarantees the algorithm's convergence and its independence from the number of training samples. Leveraging the model's structure, we introduce several bound tightening techniques - closed-form leaf prediction, empirical threshold discretization, and exact depth-1 subtree parsing - that combine with decomposable upper and lower bounding strategies to accelerate the training. The BB node-wise decomposition enables trivial parallel execution, further alleviating the computational intractability even for million-size datasets. Based on the empirical studies on several regression benchmarks containing both binary and continuous features, RS-ORT also delivers superior training and testing performance than state-of-the-art methods. Notably, on datasets with up to 2,000,000 samples with continuous features, RS-ORT can obtain guaranteed training performance with a simpler tree structure and a better generalization ability in four hours.